library(here)
library(openxlsx)
library(ggplot2)
library(ggdark)
library(dplyr)
library(data.table)
library(RColorBrewer)
library(doParallel)
library(cowplot)
library(DT)
set.seed(42)# all measurements in units of millimeter #
amrnth <- read.xlsx(here("seeds.xlsx"))Once all the height and width are measured, we assemble the raw data into a table and bootstrap them (random sample with replacement, sample size = 70% of all observed data, 100k times). For each bootstrap sample per statistic, the mean and sd are calculated, collated into lists, and assembled into a dataframe (100k rows, 3 cols). This step can be slow when using local machines in serial, so I bootstrapped in parallel using doParallel R package.
registerDoParallel(cores = 4)
par.bootstrap <- function(sample, n, iterations) {
# purpose: sample a vector with sample size n, iterations amount of time. returns raw random samples as vector.
# output very suitable for good for apply functions for performance.
registerDoParallel(getDoParWorkers()) # use maximum amount of cores.
b.vector <- foreach(i = 1:iterations) %dopar% {
sample(x = sample, size = n, replace = TRUE)
}
return(b.vector)
}
B <- 1e5
# Mean
mean.height <- lapply(par.bootstrap(amrnth$Height, 0.7 * length(amrnth$Height), B), mean)
mean.width <- lapply(par.bootstrap(amrnth$Width, 0.7 * length(amrnth$Width), B), mean)
# sd
sd.height <- lapply(par.bootstrap(amrnth$Height, 0.7 * length(amrnth$Height), B), sd)
sd.width <- lapply(par.bootstrap(amrnth$Width, 0.7 * length(amrnth$Width), B), sd)
# Visualize bootstrap statistics
p1 <- ggplot(melt(unlist(mean.height)), aes(x = value)) + geom_histogram(bins = 25) +
ggtitle("bootstrapped mean of seed height") +
xlab("mean height of seeds (mm)") +
dark_theme_light()
p2 <- ggplot(melt(unlist(mean.width)), aes(x = value)) + geom_histogram(bins = 25) +
ggtitle("bootstrapped mean of seed width") +
xlab("mean width of seeds (mm)") +
dark_theme_light()
p3 <- ggplot(melt(unlist(sd.height)), aes(x = value)) + geom_histogram(bins = 25) +
ggtitle("bootstrapped sd of seed height") +
xlab("sd of seed height (mm)") +
dark_theme_light()
p4 <- ggplot(melt(unlist(sd.width)), aes(x = value)) + geom_histogram(bins = 25) +
ggtitle("bootstrapped sd of seed width") +
xlab("sd of seed width (mm)") +
dark_theme_light()
bootstrap.mean.sd <- plot_grid(p1, p2, p3, p4, nrow = 2)
ggsave(here("figures/figure1.png"), bootstrap.mean.sd, width = 6, height = 6, dpi = 600)figure1
The bootstrap distribution of the mean and sd of seed height and width seem to follow a normal distribution. This is no surprise given the homogeneity nature of the input data. Now we calculate the volume of each iteration and see how that turns out. This step can be done in one step using dplyr’s mutate function and following the equation for the volume of an oblate spheroid. Below is an example of the resulting table and the distribution of amaranth seed volumes.
amaranth <- as.data.frame(cbind(unlist(mean.height), unlist(mean.width))); colnames(amaranth) <- c("height", "width")
amaranth <- amaranth %>%
mutate(volume = ( (pi/6) * (width**2 * height) ))
datatable(amaranth)## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
bootstrapped.mean <- ggplot(amaranth, aes(x = volume)) +
geom_histogram(bins = 100) +
ggtitle("Distribution of average bootstrapped seed volumes (n = 14, 100k iterations)") +
xlab(bquote('Volume'~(mm^3))) +
geom_vline(xintercept = mean(amaranth$volume)) +
dark_theme_light()## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().
ggsave(here("figures/figure2.png"), bootstrapped.mean, width = 6, height = 6, dpi = 600)figure2
From these stats, we can see the average bootstrapped volume of the amaranth seed is 0.47 \(mm^3\) with a 95% confidence intervals between 0.47 \(mm^3\) and 0.47 \(mm^3\).
seed_mass <- 20
biomass <- 100
n_plants <- 12Here are quick back-of-the-envelope calculations to see the yield of amaranth plants per season. Assuming I isolated 20 g of seed from 100 g of amaranth biomass (seeds + branches) from 12 plants, that is 1.67 grams of seeds per plant or 0.2 grams of seeds per gram of plant biomass.
For summer 2019, I harvested enough seeds to fill a snack sized ziploc bag (16.5 cm x 8.2 cm x about 2 cm depth) - that’s about 5.809217910^{5} number of seeds. If I were to sow the seeds I harvested in one season 1 feet apart, this ficticious plantation would have 762 rows and columns (I don’t recommend pulling this off, but do it in Missouri if you must). To put that into perspective, that’s like clearing 3 oblong city blocks (260 m x 89 m) or 9 normal city blocks (81 m x 81 m)!
PCA is a way to summarize large multidimensional datasets in two dimensions, most commonly used to see the underlying structure of the dataset. In the next update, I will give a lay-person example of how PCA can help you understand your dataset.